Skip to content

feat(resilience): API robustness improvements with UI settings sync#208

Open
Kaguya-19 wants to merge 4 commits into
OpenBMB:mainfrom
Kaguya-19:feat/robust-api-clean
Open

feat(resilience): API robustness improvements with UI settings sync#208
Kaguya-19 wants to merge 4 commits into
OpenBMB:mainfrom
Kaguya-19:feat/robust-api-clean

Conversation

@Kaguya-19

Copy link
Copy Markdown
Collaborator

Summary

  • Error classification: Canonical error codes with user-friendly hints (userHint), SettingsFix suggestions, and reordered pattern matching to prevent misclassification (rate_limit/billing before context_overflow)
  • API resilience: Configurable retry with exponential backoff, Retry-After header/message parsing, stream idle timeout, circuit breaker (ProviderHealthTracker), mid-stream rate-limit recovery, and retryProgress event broadcasting
  • Fallback eligibility: Non-retryable errors (billing, model_not_found, auth_error) now eligible for provider fallback
  • UI settings sync: transientRetry panel and per-provider retry config in Advanced sections, userHint error rendering with hint icon, structured retryProgress live status display, i18n keys (en + zh-CN)

Commits (4)

  1. feat(errors): user-friendly error classification with actionable hints
  2. fix(errors): reorder pattern matching and harden edge cases
  3. feat(resilience): remote API robustness (retry, circuit breaker, stream idle timeout)
  4. feat(ui): sync robust-api settings to frontend (settings panels, error hints, retry progress)

Changed files (31 files, +969/-54)

Backend (20 files)

  • src/model/errors/normalizeModelError.ts — semantic error classification, sanitizeErrorMessage, resolveUserHint
  • src/model/protocol/errors.ts — canonical error codes, regex patterns, parseRetryAfter*
  • src/model/streaming/streamModel.ts — configurable retries, stream idle timeout
  • src/router/RouterRuntime.tsProviderHealthTracker integration, mid-stream continuation, retry progress events
  • src/router/health/ProviderHealthTracker.ts — circuit breaker (healthy/degraded/open/half_open)
  • src/router/fallback/runFallbackChain.ts — fallback-eligible non-retryable codes
  • src/gateway/client/InProcessGateway.tsbroadcastRetryProgress, userHint passthrough
  • src/gateway/protocol/types.tsGatewayEvent userHint field

Frontend (11 files)

  • PilotDeckConfigTab.tsx — transientRetry panel + provider retry Advanced section
  • MessageComponent.tsx — userHint amber hint box
  • MessagesPaneV2.tsx — retryProgress live status step
  • pilotdeck-bridge.js — retry_progress + userHint passthrough
  • i18n {en,zh-CN}/{settings,chat}.json — translation keys

Test plan

  • tsc --noEmit — 0 new errors
  • vitest run — 12 files / 65 tests passed
  • vite build — 3372 modules compiled successfully
  • i18n JSON validation — all 4 files valid
  • Dev server smoke test — Settings page renders correctly

Made with Cursor

Kaguya-19 and others added 4 commits June 12, 2026 14:50
…hints

- Extend CanonicalModelErrorCode with billing, model_not_found,
  context_overflow, image_too_large, payload_too_large
- Add userHint + settingsFix fields to CanonicalModelError for
  actionable user-facing guidance on every classified error
- Expand error pattern matching (20+ patterns) covering Ollama,
  llama.cpp, vLLM, Bedrock, Chinese error messages, etc.
- Add 402 disambiguation: billing exhaustion vs transient rate limit
- Add sanitizeErrorMessage: extract <title> from HTML error pages,
  normalize whitespace, truncate overly long messages
- Propagate userHint through AgentError and classifyModelError
- Make billing/model_not_found/auth_error fallback-eligible
- Teach ContextOverflowRecovery about context_overflow + image_too_large

Co-authored-by: Cursor <cursoragent@cursor.com>
1. classifySemanticError: move RATE_LIMIT and BILLING patterns before
   CONTEXT_OVERFLOW to prevent "input tokens per minute" being
   misclassified as context overflow.
2. statusCodeToCode 402: check BILLING_PATTERN first so explicit
   billing exhaustion messages are never mistaken for transient
   rate limits (avoids futile retries).
3. DefaultContextRuntime inline fallback: align with
   ContextOverflowRecovery — handle image_too_large and
   context_overflow codes, check recoverableViaCompact flag.

Co-authored-by: Cursor <cursoragent@cursor.com>
1. 解析 Retry-After HTTP 头和错误消息中的 retry hint
2. 流式空闲超时(默认 5 分钟),防止连接假活永久挂起
3. Provider 级别可配置重试策略(provider.retry 生效)
4. Mid-stream 429 重试:利用 checkpoint 续传而非直接终止
5. 重试进度对用户可见(Reconnecting... 2/5)
6. 提供商健康状态追踪(简易熔断 healthy/degraded/open)

Co-authored-by: Cursor <cursoragent@cursor.com>
- Add transientRetry panel in RouterSection Advanced area
- Add per-provider retry config in ProviderCard Advanced area
- Passthrough userHint from gateway → bridge → chat error render
- Add retryProgress structured rendering in live status step
- Add GatewayEvent userHint type field
- Add i18n keys for transientRetry and provider retry (en + zh-CN)

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant